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Abstract 

When the training data in a two-class classification problem is overwhelmed by one 
class, most classification techniques fail to correctly identify the data points belonging to 
the underrepresented class. This paper proposes Similarity-based Imbalanced Classification 
(SBIC) that simultaneously optimizes the weights of the empirical similarity function and 
identifies the locations of absent data points, i.e. unobserved data points from the minority 
class. Similar to cost-sensitive approaches, SBIC operates on an algorithmic level to handle 
imbalanced structures and similar to synthetic data generation approaches, it utilizes the 
properties of unobserved data points. The results of applying the proposed method to 
imbalanced datasets suggests that SBIC is comparable to, and in some cases outperforms, 
other commonly used classification techniques for imbalanced datasets. 
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1 Introduction 


Classification is the task of identifying class labels for features belonging to a specific set 
of classes. A successful classification algorithm depends upon having a sufficient number 
of samples for each class. For instance, in a two-class classification, if the training dataset 
samples from one class significantly outnumber those from the second class, the classifi¬ 
cation algorithm may fail to correctly identify the data points belonging to the minority 
class, i.e. the class that has very few representatives in the training dataset. This type of 
classification is termed imbalanced classification, to reflect the imbalanced nature of the 


training dataset ( 

He and Garcia 

, 2009 

). In practice, identifying the points belonging to 

the minority class is more important ( 

Byon et al., 

2010), e.g., for warranty applications. 


when the cost associated with not predicting a warranty claim in advance is much higher 
than incorrectly predicting the possible occurrence of a claim. 

Many attempts to address imbalanced classification rely on synthetic oversampling, 
which artificially creates extra data points from the minority class. An important limitation 
is that these techniques consider data generation and classification as two independent 
tasks. In other words, they only alter the dataset, not the algorithm. Overcoming the 
limitation requires generating the synthetic data (maybe implicitly) using a mechanism 
that accounts for the imbalanced nature of the original data. This type of implicit synthetic 
data generation is termed absent data generation (ADG) ( [Ponrhabib et al. , 2015). ADG 
attempts to identify the locations of the synthetic data points which improve the algorithms 
performance but without necessarily generating the points themselves. 

ADG, however, is restricted to the specific formulation of kernel Fisher discriminant 
analysis (Mika et ah, 1999). Thus, if absent data can only be utilized by using a dis¬ 


criminant analysis as a base classifier, the inclusion of ADG may only marginally improve 
the classification rate. In this paper we show that ADG can be extended beyond the 
specific formulation of kernel Fisher discriminant analysis which can help achieve better 
classification for a larger class of datasets. 

Similarity-based approaches define an empirical similarity function which assesses the 
degree of resemblance between different inputs(Gilboa et ah, 2011, 2006). The method 


proposed in this paper uses the concept of similarity to locate absent data points. For 
every new input, i.e., the test data point, our algorithm uses the weighted average of the 
training data points, where the weights are determined based on the empirical similarity 
function. 

We show how an empirical similarity function can identify the location of synthetic 
data having a high degree of similarity to the existing minority data points. To make the 
synthetic data useful, we impose constraints so that the new data points are close to the 
boundary of the two classes. The proposed algorithm simultaneously learns the location 
of absent data and the parameters of the similarity function. As such, it does not need 
to generate synthetic data, but instead utilizes the points to obtain a better classifier for 
imbalanced datasets. This paper makes two contributions to the literature on imbalanced 
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classification. First, it shows that absent data can be generated using a similarity function. 
Second, the application of the proposed algorithm to the real dataset demonstrates that it 
is competitive to the state-of-the-art methods in imbalanced classification. 

The remainder of this paper is organized as follows. Section briefly reviews the 
relevant literature. Section reviews the concept of empirical similarity, formally defines 
the problem, and presents our approach for imbalanced classification. Section compares 
the performance of the proposed algorithm with two commonly used techniques using real 
datasets. Section concludes and offers suggestions for future research. 


2 Related Work 


We begin by assuming that the number of data points for one class, the minority, is either 
absolutely small, i.e., we have very few samples from that class, or is smaller relative to 
the second class, the majority. We call the former case absolute imbalance and the latter 
case relative imbalance, and briefly review the two streams of literature. 

An extreme idea for handling imbalanced datasets is to completely remove the minority 
points in the training stage based on the understanding that a small number of samples from 
the minority class may not be useful for identifying the boundary between the two classes. 


Instead, the focus is identifying the tightest bound for the majority class (Park et al., 2010). 


This class of approaches, known as novelty detection, can be useful when there are very 
few data points from the minority class. Empirical results suggest, however, that for most 
imbalanced datasets, especially if the dataset is relatively imbalanced, novelty detection 


methods are inferior compared with methods that utilize the minority samples (Pourhabib 


et ah, 2015). 


Resampling methods utilize the minority samples, e.g., bootstrapping (Efron, 1982) 


partially mitigates the effect of a low minority presence. In (Galar et al., 2012) an ensemble 


classifier takes advantage of having several datasets for learning. While these approaches 


can be effective for some specific data structures (Byon et ah, 2010; Chen et ah, 2005), 


resampling the information embedded in the location of a minority data point several times 
may cause the classifier to overemphasize the region, thus introducing significant bias. In 
addition, resampling techniques do not allow for “exploring” regions which do not have 
any actual minority points. 


The drawbacks of resampling have motivated the concept of synthetic oversampling (Chs wla 


et ah, 2002; Han et ah, 2005; Chen et ah, 2010; Barua et ah, 2014), which generates ex¬ 


tra data points based on the existing data in order to create an augmented, and less 
imbalanced, dataset. Synthetic oversampling methods differ based on the mechanisms 
they employ for data generation. For example, SMOTE (Chawla et ah, 2002) uses lin¬ 


ear interpolation between existing minority data points to generate new samples, whereas 


Borderline-SMOTE (Han et ah, 2005) utilizes both minority and majority points to create 
new samples close to the boundaries of the two classes. Synthetic oversampling can also 
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be combined with undersampling for improved performance (Ramentol et ah, 2012). 


Another stream of literature focuses on cost-sensitive methods, which modify the algo¬ 


rithm, rather than the dataset, by assigning imbalanced costs to mis-classification (Elkan 


2001; Ting, 2002; Masnadi-Shirazi and Vasconcelos, 2010). For example, in the cost sensi¬ 


tive support vector machine, the constraints in the optimization problem are such that the 
cost associated with labeling a minority data point in training as majority is much higher 
than that for a majority. Some methods combine cost-sensitive with over/undersampling fi ou 


and Liu, 2006), or employ cost-sensitive boosting (Sun et ah, 2007). 


We note that cost-sensitive methods alter the algorithm, whereas synthetic oversam¬ 
pling methods change the dataset; however, we can have a synthetic data generation mech¬ 
anism that works on an algorithm-level: if we generate data such that the data generation 
mechanism is embedded in the algorithm, as opposed to have an independent data gener¬ 
ator and a classifier, we can obtain an algorithm that generates synthetic data, or absent 


data in this context, to better identify the boundary (Pourhabib et ah, 2015) 


Section presents how we employ the idea of absent data generation embedded in a 
similarity-based algorithm. Our contribution is to demonstrate that absent data generation 


is not confined to the formulation of kernel Fisher discriminant analysis (KFDA) (Mika 


et ah, 1999). To wit, the algorithm in (Pourhabib et ah, 2015) utilizes absent data when 


the base classifier is KFDA. So, if for a specific data structure KFDA performs very poorly, 
generating absent data may only marginally improve the classification performance. By 
extending the application of the idea of absent data beyond KFDA, we demonstrate the ver¬ 
satility of absent data generation for imbalanced classification on a larger class of datasets. 


3 Similarity-based Imbalanced Classification 


Supervised learning refers to identifying a behavior in a system, manifested through a func¬ 
tion, by empirical means. Supervised learning methods endeavor to generalize based on 


the information embedded in data, i.e., they employ inductive reasoning (de Mantaras and 


[ArmengM , 1998) and then they establish rules which can be utilized to characterize the sys¬ 
tem, and predict its behavior. At the heart of this generalization is the notion of empirical 
similarity (hereafter, similarity): examine the historical data for similarities between cur¬ 


rent and previous settings and use the similarities to predict the systems behavior (Gilboa 


et ah, 2011). 


Assume a training dataset D — {(xi, ^i), (x2,7/2), ..., (x^, where x^ G is an 
input, or the system’s setting as discussed above, and G M is the system’s response, 
or behavior, for i — l,...,n. That is, we have n input-output observations based on 
which one can make a generalization about the system’s behavior. We know there exists 
a function / such that yi = /(x^), for i = l,...,n. The objective is to determine this 
function based on the information in D in order to predict the system’s behavior at an 
unseen location x^, i.e. yt = /(x^). Denote this predicted value as yt. Assume a function 
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S' : X 1 -^ M, where S(x,x') measures the similarity of x to x'. A straightforward 

application of similarity-based reasoning suggests 


yt = 


Er=i 

E-=i*5(x„x,) 


( 1 ) 


which means the predicted value yt is a weighted average of the observations in D where 
the weights are the similarity measured by S(x^, x^) for i = 1,..., n. The idea of similarity- 
based prediction is related to some other statistical predictive models such as kernel re¬ 


gression, Bayesian updating, and interpolation (see (Gilboa et ah, 2006) for a discussion) 
. The expression in equation Q presents the prediction approach intuitively. Note that 
equation which is in a very general form, needs be to tailored to fit imbalanced classifi¬ 
cation, the focus of this paper. The following sub-sections discuss a form for the similarity 
function S'(x,x') and a synthetic data generation for similarity-based classification. 


3.1 Similarity-based classification 

We focus on a two-class classification where the function value y = /(x) has only two 
values, or labels, 0 or 1. Recall that D = {(xi, ?/i), (x 2 , ^ 2 ), • • •, (x^, yn)} C T x T denotes 
the training dataset, where T is the input domain, and y is the output domain. In a 
two-class classification, we can partition the set D into D~ and such that D~ C 
contains only the data points labeled 0, and C contains only the data points 

labeled 1. Obviously, D~ = D and D~ = (j). Without loss of generality, assume 
the data points are indexed so that the first n~ data points belong to D~ and the remaining 
— n — n~ belong to . Specifically, D~ = {(xi,^i), (x 2 ,^ 2)5 • • • 5 where 

yj = 0 for i and L»+ = {(x„-+i, y„-+i), (x„-+ 2 , yn-+ 2 )) • • • > (xn, yn)}, where 

yi = 1 for z = -h 1,..., n. When we want to emphasize that an input belongs to D~ 
we denote that by x^, for i — l,...,n“, and similarly if {-Xi^yi) G we may denote 
the input by x^, for z = -h 1, -h 2,..., n. We follow the convention that names one 
dataset, D~^ negative and the other, positive. However, we label the data points 0 
for the former, and 1 for the latter which facilitates further probabilistic formulations. 


Next, define a similarity function. Following (Gilboa et ah, 2006), parametrize the 
similarity function with a vector w G The role of w is to define a weighted distance 

between x = [xi, ..., and x' = [x^, ..., G M^, specifically. 


dxxr - 


\ 




( 2 ) 


where w = [rci,..., zcp]^, and dj = Xj — for j = 1,... ,p. Then define the similarity 
function 


*5w — exp { dyj^ }, 


(3) 
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where rfw is defined in Q. In fact, S^w assigns a higher similarity measure to the points 
that have a smaller weighted distance with each other and a lower similarity measure to 
the points that are not close to each other based on the metric dw Intuitively, a large value 
for a component Wj implies that the corresponding input dimension xj contributes more 
to the value of the similarity function. Specifically, if wj > a unit of increase in the 
direction of the jth component (i.e. changing dj to dj + 1) will reduce S^w more, compared 
to that for the kth component, assuming dj = d^- While other similarity functions can also 
be employed in this framework, fo rmulas ([^ and Q are equivalent to a set of consistency 
conditions on the response y (see (Billot et ah, 2008) for details of this axiomatization). 

Next, write the weighted average of the data points based on the similarity function 

5w5 


which is always between 0 and 1. A more general form which allows for more complicated 
relationship between Zi and P{yi — l|x^, L)\(x^, t/^)) can also be used, such as any cumula¬ 
tive distribution function (CDF) whose support is the set [0,1] to relate the probability of 
= 1 to Zi, i.e.. 


Pijji = l|xi,L»\(xi,yi)) = F{zi), 


(5) 


where F{zi) denotes the value of the CDF, F, evaluated at z*. Note that since F is non- 
decreasing, a higher value for Zi shows a higher probability for x^ belonging to the positive 
class. To find optimal values for w, maximize the log-likelihood function 

n 

+ (1 - E) ln(l - F{zi))} . (6) 

1=1 


Recall that if the majority of data points in D belong to the negative class, i.e., if <C 
classification algorithms generally label many of the test points belonging to the positive 
class incorrectly, i.e. negative (He and Garcia, 2009). Note, too, that the optimized 


similarity function will be biased towards labeling most test points as negative, even though 
they may belong to the positive class, if the dataset is overwhelmed by one class. For some 
data structures, the poor performance of classification techniques can be attributed to 
insufficient information as a result of too few data points in . Note that “creating” extra 
synthetic data points using the current dataset may improve algorithmic performance, but 
doing so may introduce bias. The next sub-section explains how absent data generation 
may provide an acceptable balance between the expected classification error and the bias. 


3.2 Absent data generation 

As mentioned in Section [TJ for many imbalanced classifications, the crucial property of 
an algorithm is its ability to correctly identify the test points belonging to the minority 
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class. Therefore, balancing the dataset by generating the synthetic data points belonging 


to the minority class enhances the algorithms detection power (Chawla et ah, 2002). Most 


synthetic data generation methods have two independent algorithms: one that creates syn¬ 
thetic data, and one that performs classification on the new dataset consisting of the actual 
and synthetic data points. If the synthetic data generation mechanism is embedded within 
the classification algorithm, the new data points are generated such that the performance 
of the classifier improves compared to having a data generation algorithm independent 


from the classification algorithm (Pourhabib et ah, 2015). Hence, the idea of absent data 


comes into play, i.e., the data points that, if they existed, would help the classification 
algorithm better identify the test points belonging to the minority class. Absent data can 
be considered as a special case of synthetic data whose properties can be used to improve 
an algorithms detection power without needing to generate the synthetic data. 

Let G T+, for t = 1,..., T denote absent data points, where T+ C A represents the 
input domain for minority inputs and use the points to construct constraints that mitigate 
the low detection power problem. Since the absent data points compensate for a lack of 
sufficient number of minority points, they need to belong to the minority domain . To 
ensure each belongs to T+, restrict the absent data points to be “close” to the existing 
minority points in . To define closeness, employ the similarity function to make sure the 
absent points are similar, determined by function S'w(-, O? existing minority points, 

i.e.. 


E 


n 


t—1i=n +1 


S'w(xy x^) > A, 


(7) 


for some A > 0. Constraint 0 states that the overall similarity of all of the absent data 
points to the existing minority data points should exceed some threshold. 

Absent data being similar to the existing minority data, however, does not guarantee 
their usefulness. In other words, the synthetic data are useful as long as they are close to 


the boundary (Han et al., 2005)and the absent data must not be far away from the existing 


majority points, specifically. 


T n- 

( 8 ) 

t=i i=i 

for some 5 > 0. Constraint ^ may appear counter-intuitive, but recalling the role of absent 
data, which is to facilitate the correct boundary identification, leads to the realization 
that the data points residing far from the boundary between the two classes will not be 
informative. In fact, constraints 0 and 0 together enforce that the absent data points 
fall in a region separating the two classes. It is preferable to use constraints 0 and ^ to 
address the overall similarity between all absent data points and existing majority/minority 
data points rather than enforcing similarity between all individual points, because the latter 
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approach makes likelihood maximization very challenging due to the resulting large number 
of constraints. 

Therefore, maximize the log-likelihood Q subject to constraints 0 and 0, 
max / = 

n 

Y, {Yi HF{zi)) + (1 - y^) ln(l - , (9) 

1=1 

s.t. 

T n 

Y Y '5w(x+,x“) > A, 

t=l i=n~-^l 
T n- 

Y Y ‘5w(x 7 , xj^) > 5, (10) 

t=i i=i 


for given 5 > 0, and A > 0, where the decision variables are w, and x^, t = 1,..., T. To 
solve optimization problem (10), write the Lagrangian of the problem as. 


T n 

max 5 (w,X“)= l + Xi{Y Y 'S'w(x+,x^) - A) 

t=l i=n-+l 

T n- 

t=l i=l 


where Ai > 0 and A 2 > 0 are the Lagrangian coefficients, and is an p x T matrix whose 
column is x^. I was defined in 

It is possible to interpret optimization problem ( [TT| ) as a penalized log-likelihood max¬ 
imization, specifically, by the weights w that maximize the likelihood and penalizing any 
violation of the constraints related to the absent data. Assume the Lagrangian coefficients 
are given and find the stationary points of the objective function p(w,X^) in ( [TT| ), 


dg 

dwj 


^ {Yi - F{zi))fi d 

^^Fizi){l-Fizi))dwY 

+mY ^‘5w(x+,x“) 

t=l i=n~+l ^ 


T n- 


t=l i=l 


^S„(xr,x?: 


( 12 ) 


for j = 1,... ,m, where fi is the probability distribution function of F{zi). Note that 

d ^ (^2, ^i)yi ~ AY *5w,j(x2, x^) 

dwj 

8 


(13) 







where 


*5w,j 


A = y^S'w(xi,x^), 

£^i 

AY = ^S'w(xi,X£)yi, 

ij£i 

dwj 

_ S^^{xi,Xi){xij - Xijf 


2d^{xi,x£) 

The partial derivatives of g with respect to the absent data points are 

S'w(x+,x“) X Wj [xij - J 


(14) 


dg 

dx% 


- A, 5: 


i=n +1 


d(x+ x“) 


^ 5w(x. ,x“) X Wj (xij - x^j) 

^ ^ 


1=1 


d{x- ,x“) 


(15) 


for j = 1,... ,p, and t = 1,..., T, where ..., , and = [x^i,..., x- 

Solve the total of (T + l)p equations 




dg 


dx^j 

dg 

dwj 


- = 0, for j = 1... = 1,... ,r 


= 0, for j = 1... ,p, 


(16) 

(17) 


using; iterative nu merical techniques, such as a trust region algorithm (Byrd et ah, 2000 
Conn et al.l 2000) to minimize the sum of squares of and which can be conducted 


in polynomial time. Since the solution to equations ©-([TtI) are the points that satisfy the 
first-order necessary conditions, which due to the non-convexity of 0 are not necessarily 
the global optimal points of optimization problem 0, note that the proposed algorithm 
may become trapped in local optima for some datasets. 

Last, we need to determine the values of the Lagrangian coefficients A = [Ai, MY• The 
Lagrangian relaxation provides an upper bound for the original problem. To obtain the 
solution of ( [Tol ), minimize the maximum value of the Lagrangian relaxation. Specifi¬ 
cally, 


mini?(A) s.t. A > 0, 


(18) 


where i?(A) is the value of the objective function in (lib i.e., for a given A, if (w, X^) is a 
solution to 0 and @,i?(A) = 5 (w,X“). Section ^2| provides a discretization scheme 
to approximate i?(A), because solving (jls]) to optimality is challenging. 
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3.3 Cluster-based undersampling 

Combining undersampling of the majority data points with oversampling (synthetic or 


actual) of the minority data points (Chawla et ah, 2002) helps to identify the correct 


boundary in imbalanced data structures. Efficiency is another reason to conduct under¬ 
sampling, since a large number of majority points slows the iterative procedure for solving 
equations (16) and (Ell), 

Let denote the inputs in the training dataset containing the majority points. That 
is, D~ = {xi, X 2 ,..., x“} such that {-Xi^yi) E D~ ^ for i = l,...,n“. Cluster D~ into 
K E N clusters, {Ci,... ,Ck}, where Ck = 0, for i ^ and [jCi = D~. Then, 
for every x^ E D~^ there exists one (and only one) Ci such that Xi ^ Ci. Create (7 E N 
undersampled majority training datasets ,..., D^} such that every I — 1,..., ?7 
contains K majority training data points. Specifically, 




(19) 


where xi- E Cj, for j = 1,..., K. In other words, each DJ contains K data points, where 
the input for each data point comes from one of the K clusters {Ci,..., Ck}- To create 
each Di perform random under sampling. 

Then train the model U times based on the undersampled dataset := foi* 

7 = 1,..., ?7, specifically use Di to solve (16) and (E!l>- Each of these trainings provides 
an estimate for the probability of the training points being one, i.e. Pi{x^) — P{y^ — 
l\Di^x^)^ where x* E T is a test point. The sample average of all estimates serves as the 
prediction of the probability P{y^ — l|xx,). Such ensemble learning (Hastie et ah, 2009) 


based on undersampled majority data points has proven powerful in handling imbalanced 
data structures (Liu et ah, 2009). We use a /c-means algorithm to cluster the majority 


inputs, which can be implemented in a close to linear time complexity (Kanungo et al. 


2002). Section 4.2 presents guidelines for selecting the number of clusters for each dataset. 
Based on the proposed framework, the algorithmic steps are as follows (Table 0 lists the 
steps). 

Let (w^, X^) denote the stationary points for ^(w, X^) based on the training data in 
i.e., instead of using all the points in L), use the smaller set Di to solve (16) and ( [T7| ). Use 
to calculate Zi in (|^, and also use it to calculate P{y^ = l|x*,L)^), i.e., the probability 
of a test point x* belonging to the minority class based on the dataset D£. Find the 
probability P{y^ = l|x*) by averaging over all the predicted probabilities, specifically. 


u 


P{y^ = l|x*) = '^Tr{De)P{y^ = l\x^,De), 


( 20 ) 


e=i 


where is the prior probability associated with the dataset D^. Assigning equal prior 
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probability to each dataset yields 


1 ^ 

Piy* = i|x*) = = i|x*,L>^). 


( 21 ) 


e=i 


We call the proposed approach for estimating P{y^ = l|x*) Similarity-based Imbal¬ 
anced Classification (SBIC). In SBIC, although the values of absent data points that solve 


equations (16) do not appear in (21), they impact the optimal values of w which deter¬ 


mines 


in equation In other words, incorporating absent data into the formulation 
guides the similarity weights, w, to self-adjust themselves, as though the absent data ac¬ 
tually exist. Notably, SBIC simultaneously both absent data points and similarity weights 
simultaneously, and then uses the latter for prediction. 


Algorithm 1 Similarity-based Imbalanced Classification 


1. Given D-, D+, K, U, T, F{-), 5, A, x* € A”. 

2. Cluster D~ into K sets. Let £ = 1, P = 0. 

repeat 

3. Create according to ([T^. Let Di = U D^. 

4. Let A = A'^(£). 

5. Let (w^, X“) be a solution to and ([TtI) based on the data points in D( 


6. Calculate according to Q. 


7. z* = 


‘S'w£ i^i,^e)yt 


i. P = Piy* = 1|x*,D£) = Fiz*) 


9. i = i + l. 
until i> U. 

10. P{y* = l|x*) = Pp 


Alg orith m assumes the values of the Lagrangian coefficients A = [Ai, A 2 ]^ are given. 
Section [A2| discusses how we obtain the Lagrangian coefficients. In the algorithm, X^ stores 
all the values of A for any D^, and the algorithm picks the associated value by assigning 


X^{£) to A. 


4 Numerical studies 


Comparing the performance of different algorithms on imbalanced datasets requires care. 
Section |4.1| gives the details, SectionSection |4.2| discusses the selection of parameters as¬ 
sumed given for Algorithm and Section [dCl compares SBIC with competing algorithms. 
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4.1 Evaluation Criteria 

In general, for a two-class classification, an algorithm is deemed effective if it can correctly 
label test points as positive or negative. If (x., 2 ,?/* 2 ), • • •, {^^nte^y^nte)} 

denotes the test dataset, and yi denotes the predicted label for the test input i = 1,..., 
then measure 


-i rite 

— ^I{y*i = yi), ( 22 ) 

where /(•) is an indicator function which returns 1 if its argument is true. However, in most 
classification applications, particularly for imbalanced classifications, the cost associated 
with incorrectly labeling the positive points as negative is much higher than the opposite. 
Therefore, it is important to distinguish between the two types of error, false alarm and 
mis-detection. Specifically, let C and D~ C denote the subset of the test 
dataset that contain the positive and negative labels, respectively. Define false alarm 


yi), for (x*^,2/*D G D; 


(23) 


where \D^ \ denotes the number of negative points in the test dataset. Now define mis- 
detection 


MD 



'^I{y*i = yi), for 

I 


(24) 


Ideally, FA — MD — 0, but it does not happen except for trivial cases. Also note that SBIC 
is a probabilistic classifier. That is, SBIC does not directly predict positive or negative 
labels for a given test point x*, but it does provide a probability P{y^ = l|x*), also called 
a score, as noted in equation (21). Therefore, assign a label to x* by defining a decision 
threshold between 0 and 1, where a test point with P{y^ = l|xx,) less than or equal to the 
threshold is labeled negative. Changing the decision threshold can give different values for 
FA and MD. A trade-off between false alarm and mis-detection implies that reducing the 
threshold increases false alarms and decreases mis-detections. 

The receiver operating characteristic curve, (ROC curve) formalizes the idea of evaluat¬ 
ing a probabilistic classifier by changing the decision threshold (Bradley, 1997). The details 
are as follows. In an ROC curve, the x-axis denotes the false alarm and the y-axis denotes 
Imis-detection, also called the detection power (DP). Setting the decision threshold to 1 
corresponds to the point (0, 0), i.e., a classifier with no FA and no DP. Gradually reducing 
the threshold with steps smaller than the minimum value of the differences between scores 
yields a point with either a higher FA or a higher DP, and continuing to do so yields points 
on the ROC space, with each representing a (FA,DP) combination. Connecting all points 
yields the ROC curve. 
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An algorithm is deemed relatively superior when its ROC curve is close to the two- 
segment line from [0,0] to [0,1] and from [0,1] to [1,1] on the FA-DP axes, i.e., the curve 
is closer to the top-left region of the plot (see Figure [^. To evaluate performance, simply 
measure the area under the ROC curve, or AUC. An algorithm with a larger AUC, i.e., 
closer to 1, is deemed superior for a given dataset. Once an ROC curve is generated, use 
numerical integration to calculate the corresponding AUC. 



Figure 1: The ROC curve for two algorithms. Algorithm a is superior since its associated 
area under curve (AUC) is larger than that of Algorithm b. 


4.2 Parameters of SBIC 

Algorithm has a set of user-defined parameters. This section gives some guidelines for 
their selection. 

The number of absent data points, T , impacts the optimal values of the weights w as 
well as the efficiency of the model, since the number of equations to obtain the stationary 
points is (T -h l)p. Generating synthetic data balances the dataset, and there is no need 
to generate absent data. In fact the role of absent data is to guide the weights in order 
to account for the dataset’s imbalanced structure. Thus, a large number of absent data 
points does not necessarily improve the algorithms prediction capability. The following 
implementation uses T = p, i.e., the dimension of the input space X . Based on the 
experimental results, this setting provides a good balance between prediction accuracy and 
efficiency. 

Lagrangian coefficients A = [Ai,A 2 ]^ in 0 determine how much to penalize viola¬ 
tions of constraints Q and Q. Note that in Algorithm each undersampled data Di 
needs a value for A. Perform an exhaustive search to obtain the optimal value for A. 
Specifically, let Ai = Af 2 , • • •, Af^^} denote the set of candidate values for Ai, and 
^2 = {A 21 , A 22 , • • •, A 2 ^^} denote the set of candidate values for A 2 . Then, for a given Di 
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solve equation (16) and (© and evaluate 


T n 

i?(A5p,Ay= l + E 5w(x+x“)-A) 

t=l i=n--\-l 
T n- 

+^2g(E E X“) - 

t=l i=l 


( 25 ) 


for p^q G {1,... ^Tic}> Store the optimal values based on this approach in the array A^, 
which is used in Algorithm Note that such a discretization approach does not provide 
the optimal solution to ( [T^ ; however, for any values of (Ai, A 2 ), a solution to optimization 
problem (18) provides an upper bound for the optimization problem (© -([To|). Now utilize 
Wx, that results in an upper bound for ([^-(10) to make a prediction at G X. 

Parameters A and 6 in 0 and ^ determine a threshold for the similarity between the 
absent points and the minority or majority points, respectively. These parameters only 
appear in finding the optimal values for A in ([2^. We suggest 


(26) 

(27) 


^ ^ E ’E)’ E>E ^ 


i<j 


1 


Kj 


Equation (26) implies that the average similarity between an absent data point and the 


minority data point should be greater than that between the minority points themselves. 


Equation (27) implies that the average similarity between an absent data point and the 


majority point should be greater than 25% of the similarity between the majority points. 
Both equations use a w that is a local optimum of the likelihood function in 

Recall that clustering dataset D into K clusters in order to build an ensemble learner 
and to improve the efficiency of solving equations (16) and In other words, K needs 
to be small enough to have a sufficient number of data points in each Di and large enough 
to efficiently solve (16) and 0- If some data points are densely aggregated in one region, 
consider it as one cluster which in turn reduces K. Therefore, the selection of K depends on 
the specific dataset. In this implementation K is selected to balance the relative size of the 
majority to minority points, and K is always greater than 50 to maintain the effectiveness 
of each D^. K relates to U^ i.e. the number of undersampled datasets. If the number of 
data points in each cluster is small, a small value for U is sufficient, whereas if the number 
of data points is large, a larger U is needed. In this implementation, depending on the 
dataset, K ranges between 1, i.e., only one undersampled dataset for model training, and 
10 . 

Finally, we need to determine the CDF, E(-), in Note that Zi in 0 is always 
between 0 and 1, i.e., the support for the CDF should be between 0 and 1. Therfore, use 
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distributions such as beta or uniform. In this implementation the uniform distribution for 
F, specifically, F{zi) = Zil{{) < Zi <1) is selected. 


4.3 Toy examples 


Before reporting the results on real datasets, we present the performance of SBIC on the 
following three simulated datasets. We generate n~ = 100 data points from a normal 
distribution with mean [-1,-2]^ and variance-covariance matrix [1.1, 0.1; 0.1,1.2], which 
constitute the majority data points. For the minority data points, we create — 20 
samples from a normal distribution with mean [2,1]^ and variance-covariance matrix 
[0.6, —0.1; —0.1,1.7]. Toyl, therefore, is the dataset with well-separated minority and ma¬ 
jority samples (see plot (a-1) in Figure]^. We create another set of samples from a normal 
distribution with mean [1,1]^ and variance-covariance matrix [0.6, —0.1; —0.1,1.7]. Toy2, 
therefore, is the dataset with aggregated minority and majority samples (see plot (b-1) 
in Figure]^. Note that for both Toyl and Toy2, we undersample the majority datasets 
so that we have n~ = 50 remaining majority points. The plots in only depict the 50 
majority points along with the original = 20 minority points. Both toys use T — 2 
absent data points in the implementation of SBIC. 

Plot (a-2) in Figure shows the locations of the absent data points found by SBIC 
algorithm [TJ and the contour plots of the probabilities of belonging to the minority class. 
We obtain the contour plots by creating test points close to the minority training samples 
and then fit a continuous surface to the estimated probabilities obtained through SBIC 
algorithm The numbers on each contour curves denotes the probability of belonging to 
the minority class. Observing that the absent data points are both at the same location 
suggests that when the samples from two classes are well separated and we have a relatively 
sufficient number of training samples, the absent data points do not play an important role. 
Plot (a-3) in Figure shows the ROC curve for this example, which has a corresponding 
AUC=99.99%. 


When the minority and majority regions have more overlaps, the absent data points 
significantly impact SBIC algorithm Plot (b-2) in Figure shows that the locations 
of the absent data points are close to the boundary of the two classes, but compared 
to Toyl, they are further inside the majority region. Loosely speaking, the absent data 
points try to explore the majority region so that they are positioned in an area that helps 
the algorithm to better identify the boundary. Again, we note an important difference 
between synthetic data points in general and absent data points: the former represents the 
data points from the minority class, whereas the latter helps the algorithm to identify the 
minority region. As such, the locations of the absent data points would not necessarily 
be the same as the locations of extra samples possibly obtained from the minority class 


by linear interpolation (Chawla et ah, 2002), but they are parameters in optimization 


problem ( [To| ). We adjust these parameters to optimize the algorithm’s overall detection 
power. As mentioned we use the values of weights w rather than utilizing the actual 
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(b-1) 


(b-2) (b-3) 


Figure 2: Top: (a-1) Dataset Toyl which has two separate classes, (a-2) The locations 
of absent data points (two points very close to each other) and the contour plots for 
the predicted probabilities: the numbers on the contour curves denote the probability of 
belonging to the minority class, (a-3) The ROC curve for Toyl. Bottom: (b-1) Dataset 
Toy2 which has two overlapping classes, (b-2) The locations of absent data points and the 
contour plots for the predicted probabilities: the numbers on the contour curves denote the 
probability of belonging to the minority class. Compared with plots (a), the absent data 
points are distinct and deeper into the majority region, (b-3) The ROC curve for Toy2. 


values of the optimal absent data points in prediction. While the value of the absent data 
points affects optimal w, i.e. a solution to optimization problem (11), the deep intrusion 
of absent data points into the majority region for Toy2 violates the idea of having G T+ 
as discussed in Section 3.2 This can be a result of solving the relaxation of optimization 


problem (10). Our discretization approach to find Ai and A 2 for optimization problem (18), 
as discussed in Section (4.2), may result in a duality gap for some cases. Furthermore, the 
numerical algorithm we use to find stationary points does not guarantee global optimality. 
Despite these issues, the AUC of 97.30% shown in plot (b-3) in Figure]^ indicates a good 
performance for SBIC. 

To see how SBIC performs when the datasets are absolutely imbalanced, we create 
another dataset with the same majority samples and only five minority data points with 
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mean [0,0]^ and the same covariance matrix of the minority as in Toyl and Toy2. Toy3 
therefore, has the same 70 majority samples (see plot (a-1) in Figure [^. We observe 
that the locations of the absent data points are close to the boundary and away from 
the minority data point that is within the majority region (plot (a-2) in Figure]^. This 
experiment demonstrates the role of absent data points for more challenging cases, i.e. 
the absent data points try to explore the data region such that they “push” the weights 
towards their optimal values. The AUC of 96.40% shown in plot (a-3) in Figure [^indicates 
a good performance for SBIC. 



Figure 3: (a-1) Dataset Toy3 which suffers from absolute imbalance and overlapping re¬ 
gions. (a-2) The locations of absent data points and the contour plots for the predicted 
probabilities: the numbers on the contour curves denote the probability of belonging to 
the minority class, (a-3) The ROC curve for Toy3. 


Next, we examine the effects of parameters A and 5, which appeared in constraints Q 
and Q, respectively, on the solution of SBIC. Although A and 6 do not appear in equa¬ 
tions |l6[ ) and which determine the values of w in SBIC, they indirectly impact the 
solution to and 0 by determining Ai and A 2 in optimization problem ( [T^ . There¬ 
fore, instead of conducting the sensitivity analysis on the values of A and 5, we conduct it 
on Ai and A 2 . 

Figure shows the AUC for different combinations of Ai and A 2 for dataset Toy3. We 
produce this figure by finding the AUC for a set of (Ai, A 2 ) G [0, 4] x [0, 11], and then inter¬ 
polate the results to get a continuous surface for illustration. Figure [^suggests that when 
both Ai and A 2 are very close to zero, which means we simply perform classification using 
an empirical similarity function without generating any absent data, SBIC’s performance 
is not good in terms of AUC. A large increase in Ai, while A 2 is still close to zero, will 
have a minor effect on the performance, whereas if Ai is close to zero, increasing A 2 will 
not improve the performance. This contrast demonstrates the relative importance of con¬ 
straint 0 over constraint Q in optimization problem ( [To| ). SBIC performs consistently 
well for a large range of Ai and a range greater than 0 and smaller than 6 for A 2 , but, its 
performance deteriorates significantly for some larger values of A 2 , which is a manifestation 
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of the non-convexity of the objective function in optimization problem ( [T^ . We note that 
our cross validation technique to find Ai and A 2 provides an AUC equal to 96.40%, which 
is very close to the maximum value 97.00%, on the plot. The next section discusses the 
application of the proposed algorithm to real datasets. 


0.97 


0.96 



Figure 4: Area Under Curve using SBIC for different values of Ai and A 2 for dataset Toy3. 


4.4 Experimental Results 


We apply SBIC to real datasets and compare its performance with Cost-sensitive Support 
Vector Machine (CSSVM) ( [Veroponlos et ah , 1999) and SMOTE (Chawla et ah, 2002). 
CSSVM is an SVM algorithm designed for imbalanced classification, where the formulation 
tries a more strict classification for the minority points by assigning a higher penalty 
to their mis-classification in the training period. SMOTE generates synthetic minority 
data points by interpolation. We use an SVM algorithm on the balanced dataset (the 
dataset obtained by adding the synthetic minority data points). Most other algorithms that 
deal with imbalanced classification can be categorized into cost-sensitive approaches and 
synthetic data generation. We choose CSSVM to represent the former, and SMOTE-hSVM 
(hereafter, SMOTE) for the latter, maintaining that CSSVM and SMOTE are sufficient 
for comparing absent data generations with the two major schools of thought. 

We use nine real datasets. Five of the datasets. Breast Cancer Detection, Speech Recog¬ 


nition, Yeast, Ionosphere, and Glass are available on the UCI data repository (Lichman 


2013). The other four, Pima, E-coli, Haberman, and Vehicle are from (bro, 2014). When a 


dataset has labels for more than two classes, we randomly select one class as the minority 
and aggregate the remaining classes as the majority. Since the number of parameters to 
learn in SBIC is a function of the dimension of the data, and learning them involves solving 
a nonlinear optimization, we know that SBIC may not obtain a timely optimal solution 
for some of the higher dimensional datasets. Thus, for Vehicle and Ionosphere, we use 


Principal Component Analysis (PCA) for dimensionality reduction (Jolliffe, 2002). For the 
competing algorithms, we always use the original data without dimensionality reduction. 
Table summarizes the properties of the datasets. 
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Table 1: Datasets 


Dataset 

original dim. 

dim. used in SBIC 

# of data points 

# of maj. 

# of min. 

E-coli 

9 

9 

336 

301 

35 

Ionosphere 

34 

18 

351 

225 

126 

Yeast 

10 

10 

1484 

1449 

35 

Glass 

9 

9 

214 

197 

17 

Speech Recognition 

10 

10 

990 

900 

90 

Haberman 

3 

3 

306 

225 

81 

Vehicle 

18 

9 

846 

634 

212 

Breast Cancer 

9 

9 

699 

458 

241 

Pima 

8 

8 

768 

500 

268 


For each dataset we use five-fold cross validation. Therefore, for each algorithm we 
obtain 5 AUC values for each dataset. Since we perform cross validation, the imbalance 
ratio, i.e. the ratio of the number of majority points to the number of minority points, in 
each training dataset is fixed. Unlike our previous study (Pourhabib et al., 2015), we do not 
create datasets that are “absolutely imbalanced”, i.e., the training dataset is imbalanced 
and contains too few data points. As such, maintaining the same number of minority data 
points in each training case results in better AUCs, compared with (Pourhabib et al., 2015) 
which has some training cases containing only a few samples of minority points. 

Table summarizes the values of parameters used for each dataset. To find the La- 
grangian coefficients, through evaluating i?(Aip, \ 2 q) in equation (25), we use the candidate 
sets Al = {0.05,0.1,0.15,0.2,0.25,0.3,0.35} and A 2 = {0.5,1,3,5,7,9,11}, which means 
Tic — 7. The larger candidate values chosen for A 2 suggest the need to penalize the violation 
of constraint ^ more compared to to ensure the absent data points are close to the 
boundary of the two classes. Simply put, we do not want the existing minority points to 
lie between absent data and the majority points, but intend to have the absent data lie 
between the minority and majority points. The choice of the same candidate sets Ai and 
A 2 for all the datasets is justified by the fact that we normalize the input data so that 
< 1 for i = 1, 2,..., n in all datasets. The values reported in Table|^are the average 
values of Ais and A 2 S for all undersampled datasets for £ = 1,..., U. Hence, some of 
the values are not among the candidate values in Ai or A 2 . We determine the values of A 
and 6 according to (26) and (27); refer to Section 4.2 for the determination of K and U. 

Figure presents the average ROCs of the algorithms for each dataset. We obtain 
each average ROC by averaging the five curves each associated with one test case (since 
we do five-fold cross validation). Recall from Section 4.1 that a good way to summarize 
the information in an ROC curve is to report the area under curve (AUC). Table lists 
average values of AUC and standard deviations. 

To further illustrate the performance of SBIC, Table [^presents results for two other al¬ 


gorithms, (1) classification using only an empirical similarity function (ESF) (Gilboa et al. 


2006) and (2) absent data generation using Fisher discriminant analysis (ADGFDA) (Pourhabib 
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Table 2: Parameters in SBIC 


Dataset 

A 

(5 

average Ai 

average A 2 

K 

U 

E-coli 

55.3 

22.8 

0.08 

9.0 

56 

8 

Ionosphere 

87.4 

49.8 

0.30 

5.0 

180 

1 

Yeast 

3.5 

2.3 

0.05 

3.0 

56 

10 

Glass 

6.6 

2.0 

0.10 

9.0 

50 

2 

Speech Recognition 

13.2 

3.6 

0.10 

7.0 

144 

3 

Haberman 

0.4 

7.6 

0.10 

9.0 

130 

3 

Vehicle 

58.1 

5.5 

0.08 

8.3 

338 

3 

Breast Cancer 

45.6 

157.9 

0.20 

0.5 

355 

2 

Pima 

18.5 

29.5 

0.05 

11.0 

400 

2 


et al., 2015). ESF represents an application of the empirical similarity without utilizing any 


absent data. We include ESF to determine if the inclusion of absent data generation can 
enhance the performance of a classifier merely based on empirical similarity. We include 
ADGFDA to compare SBIC with another algorithm that utilizes absent data generation, 
but inside a different framework, namely Fisher discriminant analysis. 

The results suggest that SBIC outperforms both CSSVM and SMOTE for E-coli, Yeast, 
Breast Cancer Detection, and Pima. SBIC is competitive with CSSVM and SMOTE for 
Speech Recognition, but it performs poorly for Ionosphere, Vehicle, and Haberman. ESF 
also performs poorly on most datasets, unless the dataset is not highly imbalanced (Breast 
Cancer Detection) or if the classes are well separated (Speech Recognition). We conclude 
that SBICs performance can be attributed mostly to absent data generation rather than 
to the use of an empirical similarity function 

Comparing ADGFDA with SBIC, on the other hand, does not provide a straight¬ 
forward conclusion. For some datasets (Yeast, E-coli), ADGFDA and SBIC outperform 
the competing algorithms, whereas for Ionosphere and Vehicle, ADGFDA and SBIC do 
not outperform CSSVM and SMOTE. The results suggest that for the former subset of 
datasets, absent data generation can improve the mis-classification rate and for the latter, 
absent data generation can be inadequate (as opposed to cost-sensitive or synthetic data 
generation). In other words, absent data generation may not help improve a classifier 
performance for some unbalanced data structures. 

Another group of datasets, (Pima, Glass) show a discrepancy between the performance 
of ADGFDA and SBIC. We explain the discrepancy due to the different base classifiers, 
namely Fisher discriminate in ADGFDA and empirical similarity in SBIC. The results 
suggest that Fisher discriminant analysis may be better suited to some data structures 
compared to an empirical similarity function. 

Note that the average ROCs are obtained by averaging the curves (which involves 
interpolation and therefore approximation), whereas the average AUCs reported in Table 
are obtained by averaging the AUCs under the five curves for each dataset. As such there 
might be a slight difference between the actual AUC shown in Figure and the average 
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AUCs reported in Table For most cases however, the difference is insignificant. 


E-Coli 


Ionosphere 


Yeast 



FA 


FA 
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Figure 5: Average ROCs of CSSVM, SMOTE, and SBIC for datasets described in Table 


Acknowledging that SBIC outperforms the competing methods for some datasets, now 
we need to determine if the results are statistically significant based on the nine datasets. 
Using the data reported in Table 3, we conduct a posthoc analysis using the Friedman 
test (Demsar, 2006) to rank the algorithms. We let denote the number of test sets 
and rria denote the number of algorithms. We define R as an rrid x rria matrix whose 
(z, j)th entry denotes the AUC of algorithm j for the test set z, where j = 1,..., rria 
z = 1,... ,m^. Based on the data in matrix R we create another matrix Q of the same 
size whose (z,j)th entry denotes the rank for the algorithm j for the test set z, i.e., each 
row in the matrix Q denotes the rank of each algorithm for that test set, where the best 
algorithm has rank rria and the worst has rank 1. We let q denote an rria x 1 vector whose 
£th entry q(£) is the average value of the ith column of Q. Under the null hypothesis that 
all algorithms are equivalent and in the sense that for a given dataset they produce the 
same AUC, the Friedman statistic 




12md 


ma{ma + 1 ) 




ma{ma + 1 )" 


(28) 




follows a Chi-squared distribution with rria — 1 degrees of freedom. Here, we have rria — 5 
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Table 3: Average AUCs. The numbers in parentheses are standard deviations of five folds. 


Dataset 

CSSVM 

SMOTE 

ESF 

ADGFDA 

SBIC 

E-coli 

81.58 (8.7) 

79.82 (8.2) 

76.39 (8.3) 

83.11 (9.1) 

88.59 (5.5) 

Ionosphere 

94.24 (2.0) 

94.24 (2.1) 

84.38 (5.54) 

90.5 (1.4) 

90.85 (7.3) 

Yeast 

72.05 (17.3) 

77.05 (12.9) 

79.38 (5.5) 

89.42 (11.7) 

89.76 (4.7) 

Glass 

84.63 (19.2) 

85.81 (11.1) 

69.5 (10.9) 

87.82 (13.6) 

74.4 (11.7) 

Speech Recognition 

98.86 (0.7) 

98.66 (0.4) 

98.79 (0.01) 

99.08 (0.87) 

98.38 (1.11) 

Haberman 

68.03 (3.9) 

67.62 (3.1) 

60.76 (7.9) 

69.23 (7.6) 

63.79 (7.1) 

Vehicle 

84.57 (4.3) 

84.38 (4.1) 

67.71 (5.9) 

79.49 (4.2) 

73.73 (6.7) 

Breast Cancer 

98.86 (0.7) 

99.06 (1.02) 

98.74 (0.5) 

99.33 (0.7) 

99.29 (0.5) 

Pima 

81.42 (1.4) 

81.42 (1.4) 

75.18 (3.1) 

74.01 (5.3) 

82.39 (2.5) 


Algorithm 

CSSVM 

SMOTE 

ESF 

ADGFDA 

SBIC 

Ranking Mean 

3.24 

3.13 

2.07 

3.31 

3.24 


Table 4: Ranking mean for the algorithms based on Friedman test 


algorithms and nine datasets, but since we do a five-fold cross validation, we have = 
9 X 5 = 45 test sets. Therefore, matrix R is a 45 x 5. That is, R is the expanded form 
of the results in Table where each row in the tables is expanded into five rows for the 
matrix R. Table presents the average rankings based on the Friedman test, where 5 is 
the ranking of the best algorithm. Figure [^displays the posthoc analysis on the results of 
the test. According to Tablethe average ranking for ADGFDA is the highest among the 
five algorithms, and SBIC and CSSVM are tied for second place. Figure]^ shows that the 
difference between CSSVM, SMOTE, ADGFDA, and SBIC is not statistically significant 
(based on the nine datasets used and five-fold cross validation). In fact, the only significant 
result is that ESF performs the worst. 


1 

2 

E 


< 

4 


5 


1.5 2 2.5 3 3.5 4 

Avergae Rank 

Figure 6: Posthoc analysis on the ranking data. The bars denote approximately 95% 
confidence intervals. 

In summary, despite not being the statistically superior algorithm in this study, SBIC 
does outperform competing algorithms, in some cases quite remarkably, on some of the 


CSSVM - 

-SMOTE 

•ESF 

ADGFDA - 

SBIC - 
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datasets reported in Table This shows SBIC has some merits for imbalanced classifica¬ 
tion, and can be considered as a viable alternative, at least for some data structures, to 
traditional state-of-the-art algorithms such as CSSVM. 


5 Conclusion 


Imbalanced classification is of paramount importance in applications such as quality con¬ 
trol, healthcare informatics, and warranty claims. This paper has proposed an absent data 
generation mechanism based on empirical similarity for imbalanced classification. The ap¬ 
proach falls in the category of synthetic data generation mechanisms that are embedded 
in the classification algorithms, namely absent data generation. The proposed algorithm, 
SBIC, does not actually generate synthetic data, but instead utilizes their properties to 
identify the weights of an empirical similarity function. 

We formulated the imbalance classification problem as a constrained optimization 
framework and used numerical techniques to find the solution. Based on empirical studies 
of nine real datasets, we found that SBIC outperformed the other commonly used algo¬ 
rithms for some datasets. A failure to outperform was attributed to the fact that absent 
data generation does not necessarily improve a classifying algorithm’s prediction power, 
or to the specific mechanism for absent data generation employed in SBIC. SBIC was also 
limited by the “manual” selection of some parameters, such as 5, A, or T, which suggested 
that an automated approach for selecting parameters could potentially improve algorithmic 
performance. 

The limitations above suggest four paths for future research on SBIC. First, the im¬ 
balanced classification literature would benefit from a thorough study that determines the 
applicability of synthetic data generation, in general, and absent data generation, in par¬ 
ticular, to imbalanced datasets. Our review of the published literature found that studies 
focus primarily on empirical results, whereas establishing a theoretical foundation that 
connects the data structure to the algorithms would provide insights into improving the 
design of the SBIC algorithm for imbalanced classification. Second, SBIC should be tested 
on more absolutely imbalanced datasets for which we have only a few samples from the 
minority class, by either exploring other datasets or creating training datasets through 
undersampling (Pourhabib et ah, 2015). Third, the application of variable-bandwidth 


kernels ( [Giannakis and Majda[ |2012| ) to imbalanced classification may proved useful for 
imbalance classification because the kernels tend to be more stable in regions with low 
sample density. Fourth, since the specific structure of spatio-temporal data hinders a di¬ 
rect application of absent data generation techniques, it would be worthwhile to determine 
the applicability of imbalanced classification techniques to spatio-temporal datasets. From 


a data mining perspective, however, rare-events in spatio-temporal systems (Giannakis 


and Majda, 2012) can be categorized as minority data points. Extending similarity-based 


absent data generation to such problems, while not straightforward should be an ongoing 
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pursuit. 
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